Scalable Inductive Learning on Partitioned Data
نویسندگان
چکیده
With the rapid advancement of information technology, scalability has become a necessity for learning algorithms to deal with large, real-world data repositories. In this paper, scalability is accomplished through a data reduction technique, which partitions a large data set into subsets, applies a learning algorithm on each subset sequentially or concurrently, and then integrates the learned results. Five strategies to achieve scalability (Rule-Example Conversion, Rule Weighting, Iteration, Good Rule Selection, and Data Dependent Rule Selection) are identified and seven corresponding scalable schemes are designed and developed. A substantial number of experiments have been performed to evaluate these schemes. Experimental results demonstrate that through data reduction some of our schemes can effectively generate accurate classifiers from weak classifiers generated from data subsets. Furthermore, our schemes require significantly less training time than that of generating a global classifier.
منابع مشابه
Scalability of Learning Arbiter and Combiner Trees from Partitioned Data
Much of the research in inductive learning concentrates on problems with relatively small amounts of data residing at one location. In this paper we explore the scalability of learning arbiter and combiner trees from partitioned data. Arbiter and combiner trees integrate classiiers trained in parallel from small disjoint subsets. Previous work demonstrated their eecacy in terms of accuracy, thi...
متن کاملInductive Logic Programming meets Relational Databases: An Application to Statistical Relational Learning
With the increasing amount of relational data, scalable approaches to faithfully model this data have become increasingly important. Statistical Relational Learning (SRL) approaches have been developed to learn in presence of noisy relational data by combining probability theory with first order logic. However most learning approaches for these models do not scale well to large datasets. While ...
متن کاملA Comparative Evaluation of Voting and Meta-learning on Partitioned Data
Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of very large network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible ...
متن کاملInductive Learning in Less Than One Sequential Data Scan
Most recent research of scalable inductive learning on very large dataset, decision tree construction in particular, focuses on eliminating memory constraints and reducing the number of sequential data scans. However, state-of-the-art decision tree construction algorithms still require multiple scans over the data set and use sophisticated control mechanisms and data structures. We first discus...
متن کاملParallel Inductive Logic for Data Mining
Data mining is the process of automatic extraction of novel, useful and understandable patterns in very large databases. High-performance, scalable, and parallel computing algorithms are crucial in data mining as datasets grow in size and complexity. Inductive logic is a research area in the intersection of machine learning and logic programming, which has been recently applied to data mining. ...
متن کامل